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Plan for the Talk 


* Why Web Archives Matter 
* Key terms 
* The broader research landscape 


° Web Archives are Awesome 
° How Can we Use them? 


* Community infrastructure 








The Problem with Web 
Archives 














To me, the problem 
isn’t that we can’t 


collect enough data. 


¢ For example: Internet Archive has 635 billion+ 
URLs; 40PB of unique data (and non-Internet 
Archive collectors probably have about the same 
again). 





But rather that we 
struggle with what to 
DO with it all. 


We collect all this data, but what 
happens when it comes time to analyze 
it? 


The Wayback 
Machine isn’t always 





enough wu = ti‘ <; 2 CS!” 


* Wayback Machine is great if you know what 
you're looking for: 


BROWSE HISTORY 

Find the Wayback Machine useful? 
° Keyword search gets better and better; 4 go: |6eee 

° Diff functions are cool; 


* But not great for detailed queries: 
* i.e. websites that say X and link to Y; a 
* Exploratory text mining; Buld youroan tools 





* Working with images or video en masse; pececalia 
“Ete = 





WARC file 


WARC record 





So instead, we work 
with WARCs. 


* The WARC (ISO 28500:2009) file 
¢ Pictured at right 















* Hard to use and a bit idiosyncratic, with a 
smaller user base, so the first step is to usually 
transform the data into something that's a bit 
more common! 





WARC/1.0 
WARC-Type: resource 

WARC-Target-URI: file:/var/wwwihtdocimages/logoc.|pg 

WARC-Date: 2006-09-19T17:20:24Z 

WARC-Record-ID: <um:uuld:92283950-ef2f-4d72-b224-f54cbec90bb0> 
Content-Type: image/jpeg 

WARC-Payload-Digest: sha 1:CCHXETFVJD2MUZY6ND6SS7ZENMWF 7KQ2 
WARC-Block-Digest shat :CCHXETFVJD2MUZY6ND6SS7ZENMWF7KQ2 
Content-Length: 1662 
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Working with WARCs means 


So why would you 
ever want to do this? 





Because web archives are awesome! 





Why are they so awesome? 


Scope: Data that never 
before would have been 
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Could you do a study of the 1990s 
without web archives? 


it tt & @O ww 


Political Social Economic Cultural Etc. etc. etc. 
historians historians historians historians 


So we have a serious problem.. 







On the one hand, 
researchers will 
need to use web 
archives; 















On the other hand, 
researchers can’t 
analyze them as the 
tools and supports 
aren't there. 











a CAN WE DO: 





This requires a few different things 


So let’s take stock. 


Historians need to 
do this, but the 


Existing tools to 
analyze them often 
require 
programming skills. 


Web archives exist; 


but analyzing them 


ee skills simply are 


maybe too much. 











WE’VE BEEN 


WORKING ON THIS 
PROBLEM SINCE sy 
ATS. Archives[=\ 


Unleashed 


WE STARTED WITH THE 
“ARCHIVES UNLEASHED 
TOOLKIT? 


CHECK OUT OUR CUTTING- 
EDGE INTERFACE. 


= DRUNIROEBPEEASE = 





The Archives te 
Unleashed Toolkit = eens 


Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.@_161) 
Type in expressions to have them evaluated. 
Type :help for more information. 


scala> :paste 


It’s easy to use, as long as you: // Entering paste mode (ctr1-D to finish) 


Know how to use the command line; LUTRSRS SOLUTE EDT El 
import io.archivesunleashed.matchbox._ 


How to access a server; 
val r = RecordLoader.loadArchives("example.arc.gz", sc) 


How to use the Spark Shell; .keepValidPages() 
7 4 => ExtractDomain(r.getUrl 
How to code, at least somewhat, in Scala; Peis aN ocd. 


.take(10) 


And, have a lot of patience for open- 
source documentation! 





(OK, it's not easy to use..) 
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So the problem is still here... 


a Qa Need to have skills; 


web archives; 





Existing tools aren't But we want to do 
enough. this! 





i a, 2 see G 


SO, WE NEED MORE 
COMMUNITY AND 


INFRASTRUCTURE. 














Step ‘Oyarer | DYeyeme » WARCs are awesome, but honestly, the user 


community is too small. 


ties AVUPLNRX GS ~ You need to use DERIVATIVE datasets. 





Derivative 
Datasets 


ES 6 


[iy 


=p 


Extracting text 


Extracting hyperlinks 


Extracting images 


Extracting binary files 


Extracting anything and turning it into 
something usable 


Too big a gap between tools and users... 


ie 


- 


RESEARCHERS ARE AND THE WEB ARCHIVES 
iets ere ARE OVER HERE.. 





Olthunxe)tehnleyei masts 
Archives Research 
Compute Hub, or ARCH 
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|webdata.archive-it.org/ait/ 


ecce O < ( wobdata,archive-it.org/ait/login?next=http: 


ARCHIVET 


‘ Login 


ARCH: Archive Research Compute Hub 


Login 
To login with your Archive-It account, either prefix your 
username with ait: or use the Archive-|t login page. 


Username r 


| tw 


ait:i2millig yy 
From this webs i) 





Other Passwords for archive-it.org 


© internet Archive Archive-It_ | Help Cente 





But what do you with these 
datasets? 
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Q arch-example.ipynb 
File Edit View Insert Runtime Tools Help Cannot save changes 


+ Code 


+ Text @ Copy to Drive 


< oO @ colab.research.google.com/github/archivesunleashedjnotebo- o& + B 


@ Shae & 
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» Working with ARCH Derivatives 


In this notebook we'll download some example derivatives from Archive-It's ARCH service to demonstrate a few examples of further exploration 
of web archive data. 





~ Datasets 


First, we will need to download some derivative data from ARCH. 


For this notebook we'll be using derivatives created from the SOPA Blackout collection created by the Internet Archive Global Events Team. 


(1) 


‘capture 


!mkdir data 


feurl 
feurl 
feurl 
feurl 
teurl 
feurl 
feurl 
feurt 
feurl 
teurl 
feurl 
feurl 


“https: //webdata. archive-it.org/ait/files/download/ARCHIVEIT-03010/AudioInformationExtraction/audio-information. 
“https: //webdata.archive-it.org/ait/files/download/ARCHIVEIT-03010/DomainFrequencyExt ract ion/domain-frequency.cs 
“https://webdata.archive-it.org/ait/files/download/ARCHIVEIT—03010/DomainGraphExtract ion/domain-graph.csv.gz?acc 
“https: //webdata.archive-it.org/ait/files/download/ARCHIVEIT-03010/ImageGraphExtraction/image-graph.csv.gz?acces 
“https: //webdata.archive-it.org/ait/files/download/ARCHIVEIT-03010/ImageInformationExtraction/image-information. 
“https: //webdata.archive-it.org/ait/files/download/ARCHIVEIT—03010/PdfiInformat ionExtraction/pdf—information.csv. 
“https: //webdata.archive-it.org/ait/files/download/ARCHIVEIT-03010/PresentationProgramInformationExtract ion/powe 
“https://webdata.archive-it.org/ait/files/download/ARCHIVEIT—03010/SpreadsheetInformationExtraction/spreadsheet- 
“https: //webdata.archive-it.org/ait/files/download/ARCHIVEIT-03010/VideoInformationExtraction/video-information. 
“https: //webdata. archive-it.org/ait/files/download/ARCHIVEIT—03010/WebGraphExtract ion/web-graph.csv.gz?access=AA 
“https: //webdata.archive-it.org/ait/files/download/ARCHIVEIT-03010/WebPagesExtraction/web-pages. csv.gz?access=X$ 
“https: //webdata.archive-it.org/ait/files/download/ARCHIVEIT-03010/wordProcessorInformationExtraction/word—docun 






































Unzip the data. 


[2] 


!gunzip data/* 


~ Qs completed at 10:15 AM 





We also want to build community. 








Datathons 


° Lowering barriers 


* Bringing people together (physically, although 
our last one was virtual) 


* Establishing a community of practice to work 
with web archives 


2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL) 


Building Community and Tools for Analyzing 
Web Archives through Datathons 


Ian Milligan,' Nathalie Casemajor,” Samantha Fritz,’ Jimmy Lin,’ 
Nick Ruest,’ Matthew S. Weber,‘ and Nicholas Worby* 
' University of Waterloo *INRS  * York University Libraries 


“ University of Minnesota 


ABSTRACT 


Starting in March 2016, the Archives Unleashed team and our 
collaborators have brought together social scientists, humanists, 
archivists, Ubrarians, computer scientists, and other stakeholders 
to explore web archives as research objects. Three objectives moti- 
vated our team to develop and organize these events: facilitating 
scholarly access, community building, and skills taining. We be- 
lieve that we have been successful on all three fronts. For each 
event, over the course of two to three days, participants formed 
Interdiseiplinary teams and explored web archives using a vari- 
ety of methods and tools. This paper details our experiences in 
designing these “datathons”, with an intent to share lessons learned, 
highlight interdiseiplinary approaches to research and education 
on web archives, and describe future opportunities 


1 INTRODUCTION 
Starting in March 2016, we have brought together disparate groups 
that include those who create web archives, those who create tools 
and platforms, and those who use them for research. Each of these 
“datathons” has brought together twenty to fifty individuals, and 
over the course of two to three days, participants formed inter- 
disciplinary teams and were given access to data and computing 
infrastructure to develop a project around a web archive collection. 
‘These events have resulted in expanding participants’ knowledge 
of methods, tools, and approaches to tackling web archive data 
at scale. Qur datathons have three primary objectives: facilitating 
scholarly access, community building, and skills training. Most no- 
tably, the community-building element has seen us connect, build 
relationships, and develop the social infrastructure for a burgeoning 
network of individuals and groups held together by the common 
goal of exploring web archives as research objects 

By reflecting om these datathons, we present an approach to tools 
development and community building that can accomplish several 
‘goals. We begin with a descriptive characterlzation of these events 
and then articulate the contributions that they have made to the 
study of web archives, in three main ways. These include: 


+ The tangible development of tools and platforms that meet demon- 
strated needs (Le. better support for scholarly inquiry, as identi- 
fled in our original proposals for these events) 

© Abetter understanding of the processes by which scholars, cura- 
tors, and others work with these materials, providing a reference 
workflow with which to evaluate future research tools; 

«© The building of a community, in part supported by the continued 
use of datathon communication channels and standing infras- 
tructure, as well as encouragement to attend follow-up events 


978-1-7281-1547-4/19/831.00 ©2019 IEEE 
DOI 10.1109/JCOL.2019.00044 
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Finally, through feedback and an tlerative process of designing 
‘events, surveying participants, and starting anew, we provide gen- 
eralizable lessons around running datathons in the digital library 
and cultural heritage eaviroament, 


2 CONTEXT AND BACKGROUND 
Big data has the potential to reshape humanistic and social science 
research. The sheer amount of cultural information that is gener- 
ated and, crucially, preserved every day in electronic form, present 
‘exciting new opportunities for historians [6]. Much of this infor- 
mation is captured within web archives, which contain hundreds 
of billions of web pages, ranging from individual homepages and 
social media posts to institutional websites 

‘Web archives provide an opportunity for researchers and schol- 
ats in the humanities and social sciences, as they become an ac- 
cess point to reconstruct large-scale traces of the relatively recent 
past [8, 9]. Simply put, web data enhances research topics that date 
‘back to the mid-1990s; this is not for those studying the web per se, 
but for those examining social and cultural activities taking place 
{in an era of born-digital web sources 

Yet the opportunity to explore information and artifacts pre- 
sented in web archives is hindered by several challenges, most 
notable of which stems from the need to process and analyze the 
sheer amount of data currently available, We have more accumu- 
lated data than ever before, and the rate at which we are capturing 
potentially valuable historical data is accelerating. But the seale is 
‘overwhelming, For serious research, we need to develop new tools 
‘and methods to make sense of this digital deluge, « point which has 
been explored in several papers and workshops at JCDL (2-5, 10] 

The size of these archives eludes traditional finding aids and 
requires more than the ability to examine individual source doc~ 
uments. Today, scholars are mostly limited to viewing one page 
at atime in a web archive. In addition, web archives are currently 
underused because of the high barriers to entry. When a scholar 
approaches a web archive for the first time, she often has little idea 
‘where to start. Yet the need for access is very real. We have partic- 
pated in numerous web archiving conferences and events (Inter~ 
national Internet Preservation Consortium meetings; the Research 
Infrastructure for the Study of Archived Web Materials conferences; 
and many others), which have enabled us to become farniliar with 
this community. Conversations with colleagues at these venues 
hhave revealed several key, recurring requirements among scholars, 
which we attempt to tackle with our datathons 

It is clear that current tools for accessing archived web content 
presents challenges for scholars. While the Internet Archive makes 
archived web content available to the general public through its 


In datathons, we helped people 
catch the fish — but then they went 
nXoyantomeualeWe-VemelUlmeymubss(oncemon 
learn-how to fish. 


(To use a tired metaphor) 








With the “Cohorts? we aim 

to build a sustainable web 

archiving research 
ecosystem. 


















Cohorts 











Applications open until Thursday.. 
But if you are interested in joining 
us, email me and we can hook you 
up with an extension. 


A pretty lightweight application 
process. 





KE 3 Archives Unleashed Cohort Program 
EN 2022-2028 


Supporting research engagement with web archives 


Participate in a year-long intensive collaboration 
and receive mentorship from Archives Unleashed 


How to Apply 


¥ Opening + Closing events Assemble a team 


4 Virtual peer-support gatherings 
4 Travel support for cohort events Find THE dataset 
¥ Bi-monthly mentorship and support meetings 
¥ Access to ARCH (Archives Research Compute Hub) Submit application 
¥ $11,500 in funding (o support project work 


¥ Produce a full-length manuscript 


Important Dates 
Submissions 


Applications Due 31 March 2 


Cohort Applications available at Raeoalany Apri 2022 


hutps://archivesunleashed.orgicohorts2022-2023/ 

Projects Begins 4 July 2022 
Applications are due by Projects Ends 30 June 2023 
11:59PM EST on 31 March 2022 
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But ultimately, we are motivated by three 
key points: 


This will come in 
several shapes and sizes 


Part of it is new, usable tools (i.e. toolmakers 
coming towards researchers) 


Part of it is new cultures in the humanities and 
social sciences (i.e. researchers coming towards 
toolmakers) 


But overall, we need a shared vision that this sort 
of work is important. That we can’t do what we 
want to do unless we find some sort of common 
ground. 
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But somehow, we need to shrink this gap, 


ie 


and find a good place to meet. 


—-> 


RESEARCHERS ARE AND THE WEB ARCHIVES 
interes ARE OVER HERE.. 





IF WE CAN DO 
THIS, IT WILL BE 
WORTH IT. 





THANKS TO OUR 
FUNDERS! 





And thanks to you! 


i2millig@uwaterloo.ca 





